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0\ Summary paragraph: 

_!.' Of the different ways of representing a multi-unit system, the one afforded by networks is 

O 

^ among the most elegant and general. Endowing a system with a network representation re- 

> 

•^ quires defining nodes and links connecting them. Often physical or virtual relationships be- 

X 

C3 tween the elements of the system, e.g. anatomic brain fibres or hyper-links between the pages 

of a web site, constrain the way a link is defined. When such relationships are not clearly 
apparent, yet functional links can be built as long as time evolving variables are associated 
to each node, as e.g. the time evolution of a stock price, or of brain activity in a given region. 



We propose a third, novel, method which allows treating collections of isolated, possibly het- 
erogeneous, scalars, e.g. sets of biomedical tests, as networked systems. The method builds 
a network where each node represents a feature, while each pairing quantifies the deviation 
between those two features and the corresponding typical relationship between them within 
a studied population. Topological characteristics can then be used to extract important in- 
formation about the system. In particular, atypical or pathological conditions correspond to 
strongly heterogeneous networks, whereas typical or normative conditions are characterized 
by sparsely connected networks with homogeneous nodes. Insofar as a network representa- 
tion of each instance or subject is constructed with reference to the population to which he is 
compared, this technique is by its very nature a difference seeker. We apply the method to 
unveil the importance of specific genes in the response of a plant, the Arabidopsis thaliana, to 
osmotic stress. The most important genes turned out to be the nodes with highest centrality 
in the reconstructed networks, such that, when they are knocked out, different phenotypes 
appear. We not only confirm known results, but also highlight important genes hitherto un- 
related to the osmotic stress response of the plant. 

Along the past years, complex networks EE have provided a valuable framework for the 
analysis of a wealth of natural and man-made systems. Fields of application include genetics, pro- 
teomics and metabolomics 1^, the study of neurological diseases ^ transportation networks ^ and 
the World Wide Web '. The success of this approach dwells on its capacity of eliminating all un- 
necessary details, while eliciting the relevant backbone of interactions in an elegant mathematical 
form. 



So far, two have been the approaches used for the network representation of data sets. The 
first one, which we here call structural network reconstruction, requires one to hold explicit infor- 
mation on the connections between the system's constituting elements. An illustrative example is 
a transportation network: if, for instance, nodes represent airports, links between pairs of them are 
directly associated to flights operating the corresponding routes. 

When such an explicit structure is missing, and yet knowledge on the temporal evolution 
of the system's elements is available, the alternative is the construction of functional networks. 
Here, a multivariate data set is mapped into a structured network, where nodes represent individual 
time series and links are established based on some metrics assessing a relationship between them. 
One example is the functional representations of the brain 1332, obtained by sub-dividing the brain 
volume in different regions, by recording their activities, and by establishing connections with the 
use of different metrics, from simple linear correlations up to causality measures. 

There are, however, drawbacks affecting the applicability of functional networks. A first is 
that individuating a significant function (among the vast set available) for the establishment of the 
links might involve a large degree of subjectivity '°. A second is that a time series is generally 
required for each node in order for the functional relationships to be evaluated, and this precludes 
the use of such an approach when the system's elements are characterized by a single, static, value. 
Relevant examples include tissues and organic sample analysis, like blood analysis or spectrogra- 
phy; genetic expression levels of individuals, without evolution through time; biomedical analysis 
with neuro-imaging techniques; or social network analysis, when just a snapshot of the system 



evolution is available. 

We here introduce a third way of representing data sets as networks, that exploits information 
of a set of pre-labeled subjects to unveil the presence of reference relationships between nodes. The 
starting point is a multi-features description of subjects, e.g. a collection of medical measurements 
or of genetic expression levels, and their affiliation to one or multiple pre-defined groups. While 
working with the complete data set may result unfeasible, we consider the projection of the data 
into all possible plains created by pairs of features - see Fig. [l]b). In these plains, different 
methods (from simple linear correlations, up to more sophisticated data mining techniques) are 
used to extract reference models, one for each group, accounting for the characteristics of subjects 
in them. When a new, unlabeled, subject is considered, the deviation between its data and such 
reference models is used to weight the link between the corresponding nodes - see Fig. [T]c). The 
final result is therefore the creation of a network for each subject, where nodes represent features, 
and links are weighted according to the deviation from reference models. See Methods for a more 
detailed description of the whole procedure. 

When just one class of subjects is available, the network topology is the result of the deviation 
(from the greek T^ape^ynXLaiq) of some features from the streams created by data associated to all 
other subjects. In analogy to the spirit of the doctrine from Democritus and Epicurus, we term such 
a technique the parenclitic network representation of a system. 

As an application, we analyze the genetic expression of the plant Arabidopsis thaliana un- 
der osmotic stress, with the objective of identifying those genes orchestrating the response against 



this specific condition. Expression levels have been obtained from the AtGenExpress project^, 
including information about the 1, 922 genes composing the transcription factors of ArabidopsisliS 
at six different moments of time (30 min, 1 h, 3 h, 6 h, 12 h and 24 h after the onset of the stress). 
While the classical approach considers co-expression networks EStlS^ the parenclitic approach, on 
the contrary, focuses on those pairs of genes whose expressions depart from a reference model. 
The two techniques are therefore strongly complementary: while the former focuses on similar- 
ities between the evolutions of expression levels through time, the latter centers on differences. 
Furthermore, the construction of a different network for each time step allows tracking the plant 
response through time. 

Fig. |2] shows an example of the obtained parenclitic networks. Specifically, Fig. [2] (a) 
represents the giant component of the network corresponding to 3 h. The color of links symbolizes 
their weights, with green (red) shades indicating low (high) Z-Scores and the size of nodes being 
proportional to their a — centrality - see Methods for more details. 

The resulting network topologies are characterized by a high heterogeneous structure, dom- 
inated by a small number of hubs - see Fig. |2] (b). Such highly central nodes indicate that the 
expression levels of the corresponding genes are in general correlated with the ones of their neigh- 
bors, except at 3 h., when this correlation is broken. This suggests that hubs are performing some 
specific task at this moment of time, and therefore that they are the main actors in regulating the 
overall plant response. 

In order to confirm this hypothesis, an in vivo screening has been performed, in which genes 



corresponding to the most central nodes of each network have been knocked out, and the appear- 
ance of some phenotype has been monitored by measuring the length of the root of each plant 
- see Methods for more details. As an example, Fig. [3] reports the results obtained with seven 
transgenic lines, i.e. seven groups of plants in which the expression of one gene has been arti- 
ficially suppressed. Specifically, Fig. [3] (a) reports the mean length of roots for the seven lines, 
compared to the expected root length in the wild type (i.e., the plant without genetic modifications, 
black column). The Figure visualizes the fact that, in all the seven examples, knocking down the 
corresponding gene leads to a strongly abnormal development of the plant. 

The complete results of the in vivo screening is reported in Fig. |4} For each one of the six 
networks analyzed. Fig. |4]reports the number of genes already known to be relevant for the osmotic 
response of the plant, and the number of previously unknown genes that have been successfully 
tested. Thanks to the parenclitic network representation, 15 new genes have been identified, previ- 
ously unknown or considered unrelated to the response to osmotic stress - the full list is reported 
in Table E 

In conclusion, the parenclitic approach allows a network representation of those data sets that 
do not have a physical background of connections, nor they correspond to a temporal evolution. 
Yet, by exploiting the data associated to a set of pre-labeled subjects, and by extracting a set 
of reference models, it is possible to construct networks whose links represent the presence of 
deviations from the expected relationships. The application of this methodology to Arabidopsis 
thaliana genetic expression levels has allowed the identification of genes regulating the response 



of the plant to osmotic stress, that were previously unknown in the Literature. Besides its general 
applicability to systems for which functional representations are not possible, it has to be stressed 
that the parenclitic approach allows also merging different data sources into a single network, e.g. 
gene expression levels and blood tests. The creation of such heterogeneous individual networks 
can open new doors to the understanding of the interactions between different aspects of the system 
under study. 

Methods 

Parenclitic network reconstruction 

Consider a set of n systems, or subjects, {si, 52, ... , s„}, each one associated to one of Uc 
pre-defined classes - the class of each system will be denoted by {ci, C2, . . . , c„^}. For instance, 
each system may represent a person, classified as healthy (or control) or suffering from some 
disease. Each system i is in turn identified by a vector of n/ features /« = (/{, /2, . . . , /^ ), so that 
each system is represented by a point in a nj-dimensional space. 

The fundamental ansatz is that each class can be associated to a constrain in the feature space. 
In other words, and following the previous example, we suppose that a relationship 
-pheaithy(^j^^ /2, . . . , /„^) = dcfincs the fcaturc combination associated to a healthy subject, while 
another relationship j^disease(^j^^ f2, ■ ■ ■ , fuf) = defines the combination of features for the stud- 
ied disease. More generally, there will be Uc different relationships of this type, one for each of the 
Tie classes. 



In general, the exact expressions of the functions T are not accessible, either because ex- 
tracting them may be too complex (it may require a too high computational cost), or because not 
enough data are available. For each pairs of features % and j, the values corresponding to subjects 
of a given class c are used to create a projected constrain TiAfi, fj) = 0, modeling the relation- 
ship expected in that plane for subjects belonging to that class (see Fig. [T](b)). Such models can 
be obtained by several methods, like for instance a polynomial fit, or more generally by a data 
mining method like Support Vector Machine or Artificial Neural Networks. For each unlabeled 
subject, the distance between the position in that plane and the derived model is used to weight 
the corresponding link between nodes i and j of the parenclitic network representation - see the 
red dot and line in Fig. [T](b) and the resulting topology illustrated in Fig. [T](c). Notice that now 
we move from a feature representation (features of all subjects represented in a space) to a subject 
representation, where one network is constructed for each subject, and nodes represent features. 



Representation of Arabidopsis stress response 

The parenclitic network reconstruction method has been applied to the problem of the identi- 
fication of genes responsible for the reaction of the Arabidopsis tlialiana plant to external stresses. 
The original data set corresponds to the AtGenExpress project ^, including expression levels of 
22, 620 genes under 8 different abiotic stresses (i.e., cold, heat, drought, osmotic, salt, wounding 
and UV-B light) and at six different moments of time (30 min, 1 h, 3 h, 6 h, 12 h and 24 h after 
the onset of stress treatment). Of these, only the osmotic stress has been considered in this work, 
and the analysis has been limited to the Uf = 1, 922 genes composing the transcription factors of 
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Arabidopsis '^. 

Each subject is the status of the plant at a given time step, and therefore we aim at creating 
a network representing the genes with an abnormal expression at each time step. In other words, 
when analyzing data at time r, we create the nf(nf — 1) reference models {^ = 0} with the data 
corresponding to all other time steps, and we generate links according to the distance from that 
reference. 

Mathematically, given two gene expression levels i and j, we define our reference model as: 



// = ttii + A,/r, (1) 

fj being the expected value of gene j at time r, f[ the known expression levels of gene i, 
and aij and /3jj two free model parameters. These two coefficients are calculated by means of a 
hnear fit of all values corresponding to other time steps, i.e., minimizing the error of the relation: 



The distance between the expected (corresponding to the model ^{fl )) and the real value 
of gene j is then used to weight the link connecting nodes i and j in the network. More specifically. 



the weight of the link is the absolute value of the Z-Score of the distance 



/f - n 



Arabidopsis network analysis 

The aim of the analysis is the identification of the more central nodes (i.e., genes) within each 
of the six parenclitic networks. When a node is strongly central, indeed, it is highly connected, and 
therefore it belongs to many pairs of nodes deviating from the expected models. 

Due to the characteristics of the network, we have opted for the a — centrality measure, 
according to which the centrality of a node is a linear combination of the centralities of those to 
whom it is connected Q^. If we define a vector X of centralities such that its i*''^ component Xj is 
the centrality of the i-th node, we have: 

' (3) 

{W + a)X = XX. 

Here, W is the weight matrix of the network, and its element Wij codifies the weight of the 
link connecting nodes i and j. Notice that this is equivalent to an eigenvalue problem, with constant 
a defining weak connections between all the nodes of the network. In order to have meaningful 
results, a should be smaller than the spectral radius of W. 



Osmotic stress tolerance test 

For the screening of the transcription factors identified by the Parenclitic model, the Ara- 
bidopsis thaliana inducible lines from Transplanta collection ^^^ were used, with the ecotype Columbia 



10 



(Col-0) as the Wild Type. Each one of the transgenic Arabidopis lines of the collection expresses 
a single Arabidopsis transcription factor under the control of the /3-stradiol inducible promoter. 

For osmotic stress screening, seeds from control plants (Col-0) and at least two indepen- 
dent T3 homozygous transgenic lines (Transplanta collection ^^) of each transcription factor were 
sterilized, vernalized for 2 days at 4°C and plated onto Petri dishes containing | MS medium GH 
supplemented with IOjjM /3-Stradiol. After 5 days, seedlings were transferred to vertical plates 
containing | MS medium supplemented with 300 mM Mannitol, 10/xM /3-stradiol and transferred 
to a growth chamber at 21°C under long-day growth conditions (16/8h light/darkness). After 12 
days pictures were taken to record the phenotypes, and root elongation measurements were per- 
formed with ImageJ software!^. 
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Time step 


Gene 


Name 


Centrality 


30 m. 


AT1G 13300 


HRS1 


0.88111 


30 m. 


AT5G51910 


TCP family transcription factor 


0.729679 


30 m. 


AT4G23750 


GRF2, Cytokinin response factor 2 


0.507826 


1 h. 


AT1G44830 


DREB 


1.0 


1 h. 


AT3G 12820 


MYB10 


0.236686 


3h. 


AT2G46830 


ATCCA1 , CCA1 , Circadian clock associated 1 


0.271497 


3h. 


AT5G62320 


MYB99 


0.177404 


3h. 


AT1G29160 


C0G1 


0.148112 


6h. 


AT4G 16610 


C2H2-like zinc finger protein 


0.76//85 


6h. 


AT2G44910 


ATHB-4 


0.689358 


12 h. 


AT3G61910 


NST2 


0.264721 


24 h. 


AT1G09540 


MYB61 


0.709785 


24 h. 


AT2G40950 


ATBZIP17, BZIP17 


0.551008 


24 h. 


AT5G62320 


MYB99 


0.482752 


24 h. 


AT5G04410 


ANAC078 


0.438538 



Table 1 : Genes previously unknown in the Literature, discovered by the parencUtic rep- 
resentation, and that have been experimentally proven to develop a statistically significant 
phenotype. The right most column reports the corresponding a-centrality values. 
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Figure 1 Schematic representation of the parenclitic network reconstruction method, (a) 

The initial data set, for a set of three feature, corresponds to a set of points (green 
spheres) in a 3-dimensional space. The constrain surface (gray wired surface) represents 
the overall standard relationship of the class. A generic unlabeled subject is represented 
by a red sphere, (b) Each one of the three possible planes are considered, and data are 
there projected. The green dashed lines represent the models extracted in each plane. 
The red points are the positions of the unlabeled subject, and the red lines indicate the 
distance of the subject from the models, (c) The resulting parenclitic representation is a 
network where nodes are associated to features, and links are weighted accordingly to 
the calculated distances (coded, in this Figure, into different line widths). 

Figure 2 Parenclitic network for the response of Arabidopsis thaliana to osmotic stress 
after 3 h. (a) Representation of the giant component of the network; for the sake of clarity, 
links with weight lower than 3 have been eliminated, (b) IVIagnification of the neighborhood 
of the most central node, AT1G12610. In both cases, color represent the link weight (from 
green to red), and node sizes is associated with the corresponding a-centrality. 

Figure 3 In vivo experimental verification of the predictions from the parenclitic analy- 
sis, (a) IVIean root length corresponding to the wild type (black column) and to 7 other 
transgenic lines in which a specific gene has been knocked out. Whiskers represent the 
standard deviation corresponding to each group. Asterisks denote groups for which the 
distribution of root lengths is different with respect to the wild type with a 0.01 significance 
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level, (b) Photos of one plant of each of the 8 lines at the end of the full development 
process, (c) and (d) Photos of two vertical plates where plants are grown. In both cases, 
the left (right) photos refer to wild phenotypes (to phenotypes developed by the transgenic 
line). 



Figure 4 Screening of the experimental results. Bars account for the 20 most central 
genes at each time step. For the six time steps considered, bar colors are coded accord- 
ingly to the following stipulations: (green) the number of previously unl^nown genes, that 
have been experimentally verified to develop a statistically significant phenotype; (red) 
the number of previously unl^nown genes, that, once tested, failed to develop a pheno- 
type with significant difference with respect to the wild phenotype; (cyan) the number of 
genes predicted by the parenclitic analysis that were previously described in the Liter- 
ature; and (gray) the number of previously unl<nown genes, which could not be tested 
experimentally. 
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